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Abstract 

Graphical techniques for modeling the dependencies of random variables have been explored in a variety 
of different areas including statistics, statistical physics, artificial intelligence, speech recognition, image 
processing, and genetics. Formalisms for manipulating these models have been developed relatively 
independently in these research communities. In this paper we explore hidden Markov models (HMMs) 
and related structures within the general framework of probabilistic independence networks (PINs). The 
paper contains a self-contained review of the basic principles of PINs. It is shown that the well-known 
forward-backward (F-B) and Viterbi algorithms for HMMs are special cases of more general inference 
algorithms for arbitrary PINs. Furthermore, the existence of inference and estimation algorithms for 
more general graphical models provides a set of analysis tools for HMM practitioners who wish to explore 
a richer class of HMM structures. Examples of relatively complex models to handle sensor fusion and 
coarticulation in speech recognition are introduced and treated within the graphical model framework 
to illustrate the advantages of the general approach. 
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1 Introduction 


For multivariate statistical modeling applications, such as hidden Markov modeling for speech 
recognition, the identification and manipulation of relevant conditional independence assump¬ 
tions can be a useful tool for model-building and analysis. There has recently been a consider¬ 
able amount of work exploring the relationships between conditional independence in probability 
models and structural properties of related graphs. In particular, the separation properties of 
a graph can be directly related to conditional independence properties in a set of associated 
probability models. 

The key point of this paper is that the analysis and manipulation of HMMs can be facilitated 
by exploiting the relationship between probability models and graphs. The major advantages 
to be gained are in: 

• Model Description: A graphical model provides a natural and intuitive medium for dis¬ 
playing dependencies which exist between random variables. In particular, the structure of 
the graphical model clarifies the conditional independencies in the associated probability 
models, allowing model assessment and revision. 

• Computational Efficiency: The graphical model is a powerful basis for specifying efficient 
algorithms for computing quantities of interest in the probability model, e.g., calculation 
of the probability of observed data given the model. These inference algorithms can be 
specified automatically once the initial structure of the graph is determined. 

We will refer to both probability models and graphical models. Each consists of structure 
and parameters. The structure of the model consists of the specification of a set of conditional 
independence relations for the probability model, or a set of (missing) edges in the graph for the 
graphical model. The parameters of both the probability and graphical models consist of the 
specification of the joint probability distribution: in factored form for the probability model and 
defined locally on the nodes of the graph in the graphical model. The inference problem is that 
of the calculation of posterior probabilities of variables of interest given observable data and 
given a specification of the probabilistic model. The related task of MAP identification is the 
determination of the most likely state of a set of unobserved variables, given observed variables 
and the probabilistic model. The learning or estimation problem is that of determining the 
parameters (and possibly structure) of the probabilistic model from data. 

This paper reviews the applicability and utility of graphical modeling to HMMs. Section 2 
introduces the basic notation for probability models and associated graph structures. Section 3 
summarizes relevant results from the literature on probabilistic independence networks (or 
PINs for short), in particular, the relationships which exist between separation in a graph and 
conditional independence in a probability model. Section 4 interprets the standard first-order 
HMM in terms of PINs. In Section 5 the standard algorithm for inference in a directed PIN is 
discussed and applied to the standard HMM in Section 6. A result of interest is that the F-B and 
Viterbi algorithms are shown to be special cases of this inference algorithm. Section 7 shows 
that the inference algorithms for undirected PINs are essentially the same as those already 
discussed for directed PINs. Section 8 introduces more complex HMM structures for speech 
modeling and analyzes them using the graphical model framework. Section 9 reviews known 
estimation results for graphical models and discusses their potential implications for practical 
problems in the estimation of HMM structures, and Section 10 contains summary remarks. 
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2 Notation and Background 


Let U = {X 1 ,X 2 , ..., Xn} represent a set of discrete-valued random variables. For the purposes 
of this paper we restrict our attention to discrete-valued random variables, however, many of the 
results stated generalize directly to continuous and mixed sets of random variables (Lauritzen 
and Wermuth 1989; Whittaker 1990). Let lower case x\ denote one of the values of variable X\: 
the notation is taken to mean the sum over all possible values of X\. Let p(ay) be shorthand 
for the particular probability p(Xj = ay), whereas p(Xj ) represents the probability function for 
X{ (a table of values, since X{ is assumed discrete), 1 < i < N. The full joint distribution 
function is p(U) = (Xi, X 2 , ..., Xjv) and p( u) = (x^, x 2 ,..., xjy) denotes a particular value 
assignment for U. 

If A, B and C are disjoint sets of random variables, the conditional independence rela¬ 
tion A T B\C is defined such that that A is independent of B given C, i.e., p(A,B\C) = 
p(A\C)p(B\C). Conditional independence is symmetric. Note also that marginal independence 
(no conditioning) does not in general imply conditional independence, nor does conditional 
independence in general imply marginal independence (Whittaker 1990). 

With any set of random variables U we can associate a graph G defined as G = (V,E). V 
denotes the set of vertices or nodes of the graph such that there is a one-to-one mapping between 
the nodes in the graph and the random variables, i.e., V = {Xi,X 2 ,.. . ,Xjv}- E denotes the 
set of edges, {e(z, j)}, where i and j are shorthand for the nodes X; and Xj, 1 < i,j < N. 
Edges of the form e(i, i) are not of interest and thus are not allowed in the graphs discussed in 
this paper. 

If the edges are ordered such that e(i,j ) means that the edge is directed from node i to 
node j, i is a parent of its child j. An ancestor of node i is a node which has as a child either i 
or another ancestor of i. A subset of nodes A is an ancestral set if it contains its own ancestors. 
A descendant of i is a either a child of i or a child of a descendant of i. 

Two nodes i and j are adjacent in G if E contains the edge e(i,j). A path is a sequence of 
distinct nodes {1,..., to} such that there exists an edge for each pair of nodes {/, / + 1} on the 
path. A graph is singly-connected if there exists only one path between any two nodes in the 
graph. A cycle is a path such the beginning and ending nodes on the path are the same. A 
directed cycle is a cycle of directed edges which all point in the same direction. 

If E contains only undirected edges then the graph G is an undirected graph (UG). If E 
contains only directed edges and no directed cycles, then G is an acyclic directed graph (ADG). 
If E contains a mixture of directed and undirected edges, then it is referred to as a mixed or 
chain graph. We note in passing that there exists a theory for graphical independence models 
involving mixed graphs (Whittaker 1990) but mixed graphs will not be discussed further in this 
paper. 

For an UG G, a subset of nodes C separates two other subsets of nodes A and B if every 
path joining every pair of nodes i £ A and j £ B contains at least one node from C. For ADGs 
and mixed graphs analagous, but somewhat more complicated, separation properties exist. 

A cycle is chordless if no other than successive pairs of nodes in the cycle are adjacent. A 
graph G is triangulated if and only if the only chordless cycles in the graph contain no more 
than three nodes. Thus, if one can find a chordless cycle of length four or more, G is not 
triangulated. 

A graph G is complete if there are edges between all pairs of nodes. The digues of G are 
the largest subgraphs of G which are complete. A digue tree for G is a tree of cliques such that 
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Figure 1: An example of a jg P IN structure G which captures a particular set of condi¬ 
tional independence relationships among the set of variables {A'i,..., A^}. For example, 
V;, . ! V,. V 2 . V,. V„| { V;-.}. 


there is a one-to-one node correspondence between the cliques of G and the nodes of the tree. 


3 Probabilistic Independence Networks 

We briefly review the relation between a probability model p(U) = p( A'i,..., A'jv) and a prob¬ 
abilistic independence network structure G = (V,E). The results in this section are largely 
summarized versions of material in Pearl (1988) and Whittaker (1990) . 

A probabilistic independence network structure (PIN structure) is a graphical statement 
of a set of conditional independence relations for a set of random variables U. Absence of an 
edge e(i,j) in G implies some independence relation between A'; and Xj. Thus, a PIN structure 
G is a particular way of specifying the independence relationships present in the probability 
model p(U). We say that G implies a set of probability models p(U), denoted as Vg-, he., 
p(U) E Vg- In the reverse direction, a particular model p(U) embodies a particular set of 
conditional independence assumptions which may or may not be representable in a consistent 
graphical form. One can derive all of the conditional independence properties and inference 
algorithms of interest for U without reference to graphical models. However, as has been 
emphasized in the statistical and AI literature, and is reiterated in this paper in the context 
of hidden Markov models, there are distinct advantages to be gained from using tli& graphical 
formalism. 


3.1 Undirected Probabilistic Independence Networks (UPINs) 

A UPIN is composed of both a JfPIN structure and UPIN parameters. A UPIN structure 
specifies a set of conditional independence relations for a probability model in the form of an 
undirected graph. UPIN parameters consist of numerical specifications of a particular probabil¬ 
ity model consistent with the UPIN structure. Terms used in the literature to described UPINs 
of one form or another include Markov random fields, Markov networks, Boltzmann machines, 
and log-linear models. 
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Figure 2: A triangulated version of the UPIN structure G from Figure 1. 


3.1.1 Conditional Independence Semantics of UPIN Structures 

Let A, B , and S be any disjoint subsets of nodes in an undirected graph (UCf) G. G is an 
undirected probabilistic independence network structure (UPIN structure) for p(U) if for any 
A, B , and S such that S separates A and B in the conditional independence relation A _L B\S 
holds in p(U). The set of all conditional independence relations implied by separation in G 
constitute the (global) Markov properties of G. Figure 1 shows a simple example of a UPIN 
structure for 6 variables. 

Thus, separation in the UPIN structure implies conditional independence in the probability 
model, i.e., it constrains p(U) to belong to a set of probability models Vq which obey the 
Markov properties of the graph. Note that a complete UG is trivially a UPIN structure for 
any p(U) in the sense that there are no constraints on p(U). G is a perfect undirected map 
for p if G is a UPIN structure for p and all the conditional independence relations present 
in p are represented by separation in G. For many probability models p there are no perfect 
undirected maps. A weaker condition is that a UPIN structure G is minimal for a probability 
model p(U) if the removal of any edge from G implies an independence relation which is not 
present in the model p(U), i.e., the structure without the edge is no longer a UPIN structure for 
p(U). Minimality is not equivalent to perfection (for UPIN structures) since, for example, there 
exist probability models with independencies which can not be represented as UPINs except 
for the complete UPIN structure. For example, if A” and Y are marginally independent, but 
conditionally dependent given Z (see Figure 4(a) for an example), then the complete graph is 
the minimal UPIN structure for {A ,Y,Z} but it is not perfect because of the presence of an 
edge between A” and Y. 

3.1.2 Probability Functions on UPIN structures 

Given a UPIN structure the joint probability distribution for U can be expressed as a simple 
factorization: 

P( u) = p(x r, .. .,.rjv) = II«c(*c) ( 1 ) 

U 

where Vq is the set of cliques of xq represents a value assignment for the variables in a 
particular clique C, and the ac(xc) are 11011 -negative clique functions. The clique functions 
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represent the particular parameters associated with the UPIN structure. This corresponds 
directly to the standard definition of a Markov random field (Isham 1981). The clique functions 
reflect the relative “compatibility” of the value assignments in the clique. 

A model p is said to be decomposable if it has a minimal UPIN structure G which is trian¬ 
gulated (Figure 2). A UPIN structure G is decomposable if G is triangulated. For the special 
case of decomposable models, G can be converted to a junction tree , which is a tree of cliques 
of G arranged such that the cliques satisfy the running intersection property , namely, that each 
node in G which appears in any two different cliques also appears in all cliques on the path 
between these two cliques. Associated with each edge in the junction tree is a separator S, 
such that S contains the variables in the intersection of the two cliques which it links. Given 
a junction tree representation, one can factorize p(U) as the product of clique marginals over 
separator marginals (Pearl 1988): 


p(u) 


YlceVcP^c) 

risei/sP^s) 


(2) 


where p{xc) and p(xs) are the marginal (joint) distributions for the variables in clique C and 
separator S respectively and Vc and Vs are the set of cliques and separators in the junction 
tree. 

This product representation is central to the results in the rest of the paper. It is the basis 
of the fact that globally consistent probability calculations on U can be carried out in a purely 
local manner. The mechanics of these local calculations will be described later in the paper. 
At this point it is sufficient to note that the complexity of the local inference algorithms scales 
as the sum of the sizes of the state-spaces of the cliques. Thus, local clique updating can make 
probability calculations on U much more tractable than using “brute force” inference, if the 
model decomposes into relatively small cliques. 

Many probability models of interest may not be decomposable. However, we can define a 
decomposable cover G' for p such that G' is a triangulated, but not necessarily minimal, UPIN 
structure for p. Since any UPIN G can be triangulated simply by addition of the appropriate 
edges, one can always identify at least one decomposable cover G'. However, a decomposable 
cover may not be minimal in that it can contain edges which obscure certain independencies 
in the model p: for example, the complete graph is a decomposable cover for all possible 
probability models p. For efficient inference, the goal is to find a decomposable cover G' such 
that G' contains as few extra edges as possible over the original UPIN structure G. Later we 
discuss a specific algorithm for finding decomposable covers for arbitrary PIN structures. All 
singly-connected UPIN structures imply probability models Vg which are decomposable. 

Note that, given a particular probability model p and a UPIN G for p, the process of adding 
extra edges to G to create a decomposable cover does not change the underlying probability 
model p, i.e., the added edges are a convenience for manipulating the graphical representation, 
but the underlying numerical probability specifications remain unchanged. 

An important point is that decomposable covers have the running intersection property and 
thus can be factored as in Equation 2: thus local clique updating is also possible with non- 
decomposable models via this conversion. Once again, the complexity of such local inference 
scales with the sum of the size of state-spaces of the cliques in the decomposable cover. 

In summary, any UPIN structure can be converted to a junction tree permitting inference 
calculations to be carried out purely locally on cliques. 
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(a) 


(b) 


Figure 3: (a) A DPIN structure G D which captures a set of independence relationships among 
the set {A'i,... ,A's}. For example, A '4 T AdjAA- (b) The moral graph G M for G D , where the 
parents of A'| have been linked. 

3.2 Directed Probabilistic Independence Networks (DPINs) 

A DPIN is composed of both a DPIN structure and DPIN parameters. A DPIN structure 
specifies a set of conditional independence relations for a probability model in the form of a 
directed graph. DPIN parame ters consist of numerical specifications of a particular probability 
model consistent with the DPIN structure. DPINs are referred to in the literature using differ¬ 
ent names, including Bayes network, belief network, recursive graphical model, causal (belief) 
network, and probabilistic (causal) network. 

3.2.1 Conditional Independence Semantics of DPIN Structures 

A DPIN structure is an ADG G D = (V, E ) where there is a one-to-one correspondence between 
V and the elements of the set of random variables U = {A'i,..., Ahv}. 

The moral graph G M of G D is defined as the undirected graph obtained from G D by placing 
undirected edges between all non-adjacent parents of each node and then dropping the directions 
from the remaining directed edges (see Figure 3b for an example). The term “moral” was coined 
to denote the “marrying” of “unmarried” (nonadjacent) parents. 

Let A , B , and S be any disjoint subsets of nodes in G D . G D is a DPIN structure for p(U) 
if for any A , B , and S such that S separates A and B in G D , the conditional independence 
relation A T B\S holds in p(U). This is the same definition as for a UPIN structure except 
that separation has a different interpretation in the directed context: S separates A from B 
in a directed graph if S separates A from B in the moral (undirected) graph of the smallest 
ancestral set containing A, B , and S (Lauritzen et al. 1990). It can be shown that this is 
equivalent to the statement that a variable A'; is independent of all other nodes in the graph 
except for its descendants, given the values of its parents. Thus, as with a UPIN structure, the 
DPIN structure implies certain conditional independence relations, which in turn imply a set 
of probability models p £ Vqd . Figure 3a contains a simple example of a DPIN structure. 
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Figure 4: (a) The DPIN structure to encode the fact that A '3 depends on Afl and A '2 but Afl T 
A' 2 . For example, consider that A'i and A '2 are two independent coin flips and that A '3 is a bell 
which rings when the flips are the same. There is no perfect UPIN structure which can encode 
these dependence relationships, (b) A UPIN structure which encodes Afl T A' 4 |{A' 2 , A' 3 } and 
A '2 T A' 3 |{A'i, A' 4 }. There is no perfect DPIN structure which can encode these dependencies. 

3.2.2 Probability Functions on DPINs 

A basic property of a DPIN structure is that it implies a direct factorization of the joint 
probability distribution p(U): 

N 

p( u) = (3) 

8 = 1 

where pa(xi) denotes a value assignment for the parents of A';. A probability model p can 
be written in this factored form in a trivial manner by the conditioning rule. Consequently 
there are many possible DPIN structures consistent with a particular probability model p, 
potentially containing extra edges which hide true conditional independence relations. Thus, 
one can define minimal DPIN structures for p in a manner exactly equivalent to that of UPIN 
structures: deletion of an edge in a minimal DPIN structure G D implies an independence 
relation which does not hold in p £ Vqd . Similarly, G D is a perfect DPIN structure G for p if 
G d is a DPIN structure for p and all the conditional independence relations present in p are 
represented by separation in G D . As with UPIN structures, minimal does not imply perfect 
for DPIN structures. For example, consider the independence relations Afl T Ahi|{A~2, A'3} and 
AT 2 T A~ 3 |{A~i, A'4}: the minimal DPIN structure contains an edge from A'3 to A'2 (see Figure 
4(b)). 

3.3 Differences between Directed and Undirected Graphical Representa¬ 
tions 

It is an important point that directed and undirected graphs possess different conditional inde¬ 
pendence semantics. There are common conditional independence relations which have perfect 
DPIN structures but 110 perfect UPIN structures and vice-versa (see Figure 4 for examples). 

Does a DPIN structure have the same Markov properties as the UPIN structure obtained by 
dropping all the directions 011 the edges in the DPIN structure? The answer is yes if and only if 
the DPIN structure contains 110 subgraphs where a node has two or more non-adjacent parents 
(Whittaker 1990; Pearl et al. 1990). I 11 general, it can be shown that if a UPIN structure G 
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for p is decomposable (triangulated) then it has the same Markov properties as some DPIN 
structure for p. 

On a more practical level, DPIN structures are frequently used to encode causal information, 
i.e., to formally represent the belief that X{ preceeds Xj in some causal sense, e.g., temporally. 
DPINs have found application in causal modelling in applied statistics and artificial intelligence. 
Their popularity in these fields stems from the fact that the joint probability model can be 
specified directly via Equation 3, i.e., via the specification of conditional probability tables or 
functions (Spiegelhalter et al. 1991). In contrast, UPINs must be specified in terms of clique 
functions (as in Equation 1) which may not be as easy to work with (cf. Geman and Geman 
(1984), Modestino and Zhang (1992) and Vandermeulen et al. (1994) for examples of ad hoc 
design of clique functions in image analysis). UPINs are more frequently used in problems 
such as image analysis and statistical physics where associations are thought to be correlational 
rather than causal. 

3.4 From DPINs to (Decomposable) UPINs 

The moral UPIN structure G M (obtained from the DPIN structure G D ) does not imply any 
new independence relations which are not present in G D . As with triangulation, however, 
the additional edges may obscure conditional independence relations which are implicit in the 
numeric specification of the original probability model p associated with the DPIN structure 
G D . Furthermore, G M may not be triangulated (decomposable). By the addition of appropriate 
edges, the moral graph can be converted to a (non-unique) triangulated graph G', namely a 
decomposable cover for G M . In this manner, for any probability model p for which G D is a 
DPIN structure, one can construct a decomposable cover G' for p. 

This mapping from DPIN structures to UPIN structures was first discussed in the context 
of efficient inference algorithms by Lauritzen and Spiegelhalter (1988). The advantage of this 
mapping derives from the fact that analysis and manipulation of the resulting UPIN is consid¬ 
erably more direct than dealing with the original DPIN. Furthermore, it has been shown that 
many of the inference algorithms for DPINs are in fact special cases of inference algorithms for 
UPINs and can be considerably less efficient (Shachter et al. 1994). 


4 Modeling HMMs as PINs 

4.1 PINs for HMMs 

In hidden Markov modeling problems (Poritz 1988; Rabiner 1989) we are interested in the set 
of random variables U = {Hi, Or, H^, O2, ■ ■ ■, H^-i, Ojv-i, Hn, Ojv}, where Hi is a discrete¬ 
valued hidden variable at index i, and Oi is the corresponding discrete-valued observed variable 
at index i, 1 < i < N (the results here can be directly extended to continuous-valued observ¬ 
ables). The index i denotes a sequence from 1 to N, for example, discrete time steps. Note 
that Oi is considered univariate for convenience: the extension to the multivariate case with d 
observables is straightforward but is omitted here for simplicity since it does not illuminate the 
conditional independence relationships in the HMM. 

The well-known simple first-order HMM obeys the following two conditional independence 
relations: 

Hi T {Hi, Or,..., Hi- 2 , Oi- 2 , Oi-i}\Hi-i, 2 <i<N (4) 
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Figure 5: (a) The PIN structure for HMM(1,1) (b) A corresponding junction tree. 


O] °2 

2 2 


°N-1 °N 

2 2 


O] o 2 0 N _] o N 

0 0 0 0 


0 -K3- -KJ-►O 


t t 

O-K> 


t t 

♦O-►O 


H] H 2 


h N- 1 h N 


H, H 2 


h N- 1 h N 


(a) 


(b) 


Figure 6: DPIN structures for HMM(1,1): (a) the DPIN structure for the HMM(1,1) probability 
model, (b) a DPIN structure which is not a DPIN structure for the HMM( 1,1) probability model. 


and 

Oi 1 {H u Or,..., Hi-uOi-^Hi, 2 <i<N (5) 

We will refer to this “first-order” hidden Markov probability model as HMM(1,1): the notation 
HMM(A’,./) is defined such that the model has state memory of depth K and contains J 
separate underlying state processes. The notation will be clearer in later sections when we 
discuss specific examples with A’,-/ > 1. 

Construction of a PIN for HMM(1,1) is particularly simple. In the undirected 
case, assumption 1 requires that each state Hi is only connected to Aj-i from the set 
{Hi, • •, Hi- 2 , Oi- 2 , Oi-i}. Assumption 2 requires that Oi is only connected to Hi. The 
resulting UPIN structure for HMM(1,1) is shown in Figure 5a. This graph is singly-connected 
and thus implies a decomposable probability model p for HMM(1,1), where the cliques are of 
the form {A;, Oi} and {A;_i, Hi} (Figure 5b). In Section 5 we will see how the joint probability 
function can be expressed as a product function on the junction tree, thus leading to a junction 
tree definition of the familiar F-B and Viterbi inference algorithms. 

For the directed case the connectivity for the DPIN structure is the same. It is natural to 
choose the directions on the edges between A;_i and Hi as going from i — 1 to i (although the 
reverse direction could also be chosen without changing the Markov properties of the graph). 
The directions on the edges between Hi and Oi must be chosen as going from Hi to Oi rather 
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than in the reverse direction (Figure 6a). In reverse (Figure 6b) the arrows would imply that 
Oi is marginally independent of which is not true in the HMM(1,1) probability model. 

The proper direction for the edges implies the correct relation, namely that Oi is conditionally 
independent of Hi -1 given Hi. 

The DPIN structure for HMM(1,1) does not possess a subgraph with non-adjacent parents. 
As stated earlier this implies that the implied independence properties of the DPIN structure are 
the same as those of the corresponding UPIN structure obtained by dropping the directions from 
the edges in the DPIN structure, and thus they both result in the same junction tree structure 
(Figure 5b). Thus, for the HMM(1,1) probability model, the minimal directed and undirected 
graphs possess the same Markov properties, i.e., imply the same conditional independence 
relations. Furthermore, both PIN structures are perfect maps for the directed and undirected 
cases respectively. 

4.2 Inference and MAP Problems in HMMs 

In the context of HMMs, the most common inference problem is the calculation of the likelihood 
of the observed evidence given the model, i.e., p(cq,..., ojv|model), where the oi,..., ojv denote 
observed values for 0\,... ,0n- (In this section we will assume that we are dealing with 
one particular model where the structure and parameters have already been determined and, 
thus, we will not explicitly indicate conditioning on the model). The “brute force” method for 
obtaining this probability would be to sum out the unobserved state variables from the full 
joint probability distribution: 

p(oi, ■ ■ -,o N ) = P(Hi, o\, ..H n , o n ) (6) 

hi 

where hi denotes the possible values of hidden variable Hi. 

Another inference calculation of interest is the calculation ofp(/qjcq,..., ojv), for any or all h 
namely, the probability of a particular hidden state value given the observed evidence. Inferring 
the posterior state probabilities is useful when the states have direct physical interpretations 
(as in fault monitoring applictions (Smyth 1994)) and is also implicitly required during the 
standard Baum-Welch learning algorithm for HMM(1,1). 

In general, both of these computations scale as m N where m is the number of states for 
each hidden variable. In practice, the F-B algorithm (Poritz 1988; Rabiner 1989) can perform 
these inference calculations with much lower complexity, namely Nm 2 . The likelihood of the 
observed evidence can be obtained with the forward step of the F-B algorithm: calculation of 
the state posterior probabilities requires both forward and backward steps. The F-B algorithm 
relies on a factorization of the joint probability function to obtain locally recursive methods One 
of the key points in this paper is that the graphical modeling approach provides an automatic 
method for determining such local efficient factorizations, for an arbitrary probabilistic model, 
if efficient factorizations exist given the Cl relations specified in the model. 

The MAP identification problem in the context of HMMs involves identifying the most likely 
hidden state sequence given the observed evidence. Just as with the inference problem, the 
Viterbi algorithm provides an efficient, locally recursive method for solving this problem with 
complexity fra 2 , and again, as with the inference problem, the graphical modeling approach 
provides an automatic technique for determining efficient solutions to the MAP problem for 
arbitrary models, if an efficient solution is possible given the structure of the model. 
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5 Inference and MAP Algorithms for DPINs 


Inference and MAP algorithms for DPINs and UPINS are quite similar: the UPIN case in¬ 
volves some subtleties not encountered in DPINs and so discussion of UPIN inference and 
MAP algorithms is deferred until Section 7. The inference algorithm for DPINs (developed 
by Jensen, Lauritzen and Oleson (1990) and hereafter referred to as the JLO algorithm) is 
a descendant of an inference algorithm first described by Lauritzen and Spiegelhalter (1988). 
The JLO algorithm applies to discrete-valued variables: extensions to the JLO algorithm for 
Gaussian and Gaussian-mixture distributions are discussed in Lauritzen and Wermuth (1989). 
A closely related algorithm to the JLO algorithm, developed by Dawid (1992), solves the MAP 
identification problem with the same time-complexity as the JLO inference algorithm. 

We show that the JLO and Dawid algorithms are strict generalizations of the well-known 
F-B and Viterbi algorithms for HMM(1,1), in that they can be applied to arbitrarily complex 
graph structures (and thus a large family of probabilistic models beyond HMM(1,1)) and handle 
missing values, partial inference, and so forth in a straightforward manner. 

There are many variations on the basic JLO and Dawid algorithms. For example, Pearl 
(1988) describes related versions of these algorithms in his early work . However, it can be 
shown (Shachter et al. 1994) that all known exact algorithms for inference on DPINs are 
equivalent at some level to the JLO and Dawid algorithms. Thus, it is sufficient to consider 
the JLO and Dawid algorithms in our discussion as they subsume other graphical inference 
algorithms. 

The JLO and Dawid algorithms operate as a two-step process: 

1. The construction step: this involve a series of sub-steps where the original directed graph 
is moralized and triangulated, a junction tree is formed, and the junction tree is initialized. 

2. The propagation step: the junction tree is used in a local message-passing manner to 
propagate the effects of observed evidence, i.e., to solve the inference and MAP problems. 

The first step is carried out only once for a given graph. The second (propagation) step is 
carried out each time a new inference for the given graph is requested. 

5.1 The Construction Step of the JLO Algorithm: From DPIN structures 
to Junction Trees 

We illustrate the construction step of the JLO algorithm using the simple DPIN structure, 
G d , over discrete variables U = {X\,..., Xq} shown in Figure 7a. The JLO algorithm first 
constructs the moral graph G M (Figure 7b). It then triangulates the moral graph G M to obtain 
a decomposable cover G' (Figure 7c). The algorithm operates in a simple greedy manner based 
on the fact that a graph is triangulated if and only if all of its nodes can be eliminated, where 
a node can be eliminated whenever all of its neighbors are pairwise linked. Whenever a node is 
eliminated, it and its neighbors define a clique in the junction tree that is eventually constructed. 
Thus, we can triangulate a graph and generate the cliques for the junction tree by eliminating 
nodes in some order, adding links if necessary. If no node can be eliminated without adding 
links, then we choose the node that can be eliminated by adding the links that yield the clique 
with the smallest state-space (Jensen 1995). 

After triangulation the JLO algorithm constructs a junction tree from G 1 , i.e., a clique tree 
satisfying the running intersection property. The junction tree construction is based on the 
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Figure 7: (a) A simple DPIN structure G D . (b) The corresponding (undirected) moral graph 
G M . (c) The corresponding triangulated graph G'. (d) The corresponding junction tree. 


following fact. Define the weight of a link between two cliques as the number of variables in 
their intersection. Then, a tree of cliques will satisfy the running intersection property if and 
only if it is a spanning tree of maximal weight. Thus, the JLO algorithm constructs a junction 
tree by choosing successively a link of maximal weight unless it creates a cycle. The junction 
tree constructed from the cliques defined by the DPIN structure triangulation in Figure 7c is 
shown in Figure 7d. 

The worst-case complexity is 0(N 3 ) for the triangulation heuristic and 0(N 2 log N) for the 
maximal spanning tree portion of the algorithm. This construction step is carried out only once 
as an initial step to convert the original graph to a junction tree representation. 

5.2 Initializing the Potential Functions in the Junction Tree 

The next step is to take the numeric probability specifications as defined on the directed graph 
G d (Equation 3) and convert this information into the general form for a junction tree repre¬ 
sentation of p (Equation 2). This is achieved by noting that each variable A'; is contained in at 
least one clique in the junction tree. Assign each A'; to just one such clique and for each clique 
define the potential function ac(C) to be either the product of p(Xi\pa(Xi)) or 1 if no variables 
are assigned to that clique. Define the separator potentials (in Equation 2) to be 1 initially. 

In the section which follows we describe the general JLO algorithm for propagating messages 
through the junction tree to achieve globally consistent probability calculations. At this point it 
is sufficient to know that a schedule of local message passing can be defined which converges to 
a globally consistent marginal representation for p, i.e., the potential on any clique or separator 
is the marginal for that clique or separator (the joint probability function). Thus, via local 
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message-passing, one can go from the initial potential representation defined above to a marginal 
representation: 


T\cev c P( x c) 


Usev s P( x s) 


(7) 


At this point the junction tree is initialized. This operation in itself is not that useful, of more 
interest is the ability to propagate information through the graph given some observed data 
and the initialized junction tree, e.g., to calculate the posterior distributions of some variables 
of interest. 

From this point onwards we will implicitly assume that the junction tree has been initialized 
as described above so that the potential functions are the local marginals. 


5.3 Local Message Propagation in Junction Trees Using The JLO Algorithm 

In general p(U) can be expressed as 

ricGVg a c( x C ) 

risery bs(xs) 

where the ac and 65 are non-negative potential functions (the potential functions could be 
the initial marginals described above for example). K = ({ac : C G Vc},{bs ■ S G 5c}) is a 
representation for p(U). A factorizable function p(U) can admit many different representations, 
i.e., many different sets of clique and separator functions which satisfy Equation 8 given a 
particular p(U). 

As mentioned above, the JLO algorithm carries out globally consistent probability calcu¬ 
lations via local message-passing on the junction tree, i.e., probability information is passed 
between neighboring cliques and clique and separator potentials are updated based on this lo¬ 
cal information. A key point is that the cliques and separators are updated in a fashion which 
ensures that at all times K is a representation for p(U), i.e., Equation 8 holds at all times. 
Eventually the propagation converges to the marginal representation given the initial model 
and the observed evidence. 

The message-passing proceeds as follows. We can define a flow from clique C) to Cj in the 
following manner where and Cj are two cliques which are adjacent in the junction tree. Let 
Sk be the separator for these two cliques. Define 


( 8 ) 



b *s k ( x s k ) = a Ci(x Ci ) 

Ci\S k 

(9) 

and 


a c 3 ( x c/) = a Cj (x Cj )\s k (xs k ) 

( 10 ) 

where 


Sk( Sk) b Sk (x Sk y 

( 11 ) 


A s k ( x s k ) is the update factor. Passage of a flow corresponds to updating the neighboring clique 
with the probability information contained in the originating clique. This flow induces a new 
representation K* = ({af : C G Vc}, {65 : S G 5c}) for p(U). 

A schedule of such flows can be defined such that all cliques are eventually updated with 
all relevant information and the junction tree reaches an equilibrium state. The most direct 


13 



scheduling scheme is a two-phase operation where one node is denoted the root of the junction 
tree. The collection phase involves passing flows along all edges towards the root-clique (if a 
node is scheduled to have more than one incoming flow, the flows are absorbed sequentially). 
Once collection is complete, the distribution phase involves passing flows out from this root in 
the reverse direction along the same edges. There are at most two flows along any edge in the 
tree in a non-redundant schedule. Note that the directionality of the flows in the junction tree 
need have nothing to do with any directed edges in the original DPIN structure. 

5.4 The JLO Algorithm for Inference given Observed Evidence 

The particular case of calculating the effect of observed evidence (inference) is handled in the 
following manner. Consider that we observe evidence of the form e = {X; = x*,Xj = x *,...} 
and U e = {Xi,Xj, .. .} denotes the set of variables which have been observed. Let = U\U e 
denote the set of hidden or unobserved variables and a value assignment for XJ h . 

Consider the calculation ofpCU^e). Define an evidence function g e {xi) such that 


Let 


■ (r .) = J 1 if **■ = x i 

1 (0 otherwise. 

(12) 

/*( u ) = pHII/Od) 

(13) 

IT 



Thus, we have that /*(u ) oc ^(u^le). To obtain /*(u) by operations on the junction tree one 
proceeds as follows. First assign each observed variable X{ £ U e to one particular clique which 
contains it (this is termed “entering the evidence into the clique”). Let C E denote the set of 
all cliques into which evidence is entered in this manner. For each C £ C E let 


gc(xc) = II 9 e (x t ) (14) 

{i-.Xi is entered into C} 


Thus, 

f*(u) = p(u)x n gc(x c '). (15) 

cec E 

One can now propagate the effects of these modifications throughout the tree using the collect 
and distribute schedule described in 5.3. Let Xq denote a value assignment of the hidden 
(unobserved) variables in clique C. When the schedule of flows is complete one gets a new 
representation Kj such that the local potential on each clique is f*{xc ) = p(s(f,e), i.e., the 
joint probability of the local unobserved clique variables and the observed evidence (Jensen et 
al. 1990) (similarly for the separator potential functions). If one marginalizes at the clique over 
the unobserved local clique variables, 


Y,P( x cX) = p(e), (16) 

x h c 

one gets the probability of the observed evidence directly. Similarly, if one normalizes the 
potential function at a clique to sum to 1, one obtains the conditional probability of the local 
unobserved clique variables given the evidence, p(a;(t|e). 
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5.5 Complexity of the Propagation Step of the JLO Algorithm 

In general, the time complexity T of propagation within a junction tree is s(C))) where 

Nc is the number of cliques in the junction tree and s(Ci) is the number of states in the joint 
state-space of C'i (equal to the product over each variable in C'i of the number of states of 
each variable). Thus, for inference to be efficient, we need to construct junction trees with 
small clique sizes. Problems of finding optimally small junction trees (e.g., finding the junction 
tree with the smallest maximal clique) are NP-hard. Nonetheless, the heuristic algorithm for 
triangulation described earlier has been found to work well in practice (Jensen et al. 1990; 
Jensen 1995). 

6 Inference and MAP Calculations in HMM(1,1) 

6.1 The F-B Algorithm for HMM(1,1) is a Special Case of the JLO Algo¬ 
rithm 

Figure 5b shows the junction tree for HMM(1,1). In this section we apply the JLO algorithm to 
the HMM(1,1) junction tree structure to obtain a particular inference algorithm for HMM(1,1). 
As mentioned earlier, the HMM(1,1) inference problem consists of being given a set of values 
for the observable variables, 


e = {Oi = oi, 0 2 = 02 , • • •, On = <Lv} (17) 

and inferring the likelihood of e given the model. As described in the previous section this 
problem can be solved exactly by local propagation in any junction tree using the JLO inference 
algorithm. 

Let the final clique in the chain containing (H^-i, -Hjv) be the root clique. Thus, a non- 
redundant schedule consists of first recursively passing flows from each ( Oi , Hi) and (Ff 4 -_ 2 , -ffj-i) 
to each (i/j_ 1 , Hi) in the appropriate sequence (the “collect” phase), and then distributing flows 
out in the reverse direction from the root clique. If we are only interested in calculating the 
likelihood of e given the model, then the distribute phase is not necessary since we can simply 
marginalize over the local variables in the root clique to obtain p(e). 

A comment on notation: subscripts on potential functions and update factors indicate which 
variables have been used in deriving that potential or update factor, e.g., fo 1 indicates that this 
potential has been updated based on information about 0\ but not using information about 
any other variables. 

Assume that the junction tree has been initialized so that the potential function for each 
clique and separator is the local marginal. Given the observed evidence e, each individual piece 
of evidence O = o* is entered into its clique (Oi, Hi) such that each clique marginal becomes 
fQ^(hi,Oi) = p(hi,o*) after entering the evidence (as in Equation 14). 

Consider the portion of the junction tree in Figure 8, and in particular the flow between 
( Oi,Hi) and (i/j_i, Hi). By definition the potential on the separator Hi is updated to 

f^(hi) = J2f*( h ^°i)=P( h ^ 0 *) ( 18 ) 

The update factor from this separator flowing into clique is then 

x ° Ahi> = =p(a - ]hi> - < i9) 
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Figure 8: Local message passing in the HMM(1,1) junction tree during the collect phase of a 
“left to right” schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate 
flows. 


This update factor is “absorbed” into Hi) as follows: 

fo, ( h i- 1 > hi) = p( h;-!, hi )X 0 , ( hi) = p( i, h t )p( o*\h t ) ( 20 ) 

Now consider the flow from clique (Hi- 2 , Si-i) to clique (Hi-\,Hi). Let <l>, ; = {0;,.. -,Oj} 
denote a set of consecutive observable variables and <f>* • = {o*,..., o*} denote a set of observed 
values for these variables, 1 < i < j < N. Assume that the potential on the separator i?j_i has 
been updated to 

(hi- 1 )=p*(h i _ 1 ^l i -i) ( 21 ) 

via earlier flows in the schedule. Thus, the update factor on separator Hi-i becomes 




P*(hi-i,<t>t,i-i) 


P(hi- r) 

and this gets absorbed into clique (Hi-i, Hi) to produce 


( hi- 1 , hi ) — fo t ( hi- 1 , hi ) t-1 ( hi—i ) 

p^hi-^^ _,) 

= p (/?;_!, hi )p(o t \h , t )- ^ - 

= Pi o* | h, t )p( h, t | hi-x )p* ( hi- 1 , ^ )i _ 1 ). 


Finally, we can calculate the new potential on the separator for the flow from clique Hi) 

fo (Hi, 

/..., '• h ■ = ( 24 ) 

hi-1 

= P(o*\hi) ( 25 ) 

h,.~ l 

= p(o*|/? ; ) ^ p()/<!! ,_! () ( 26 ) 
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K + 7- 


Figure 9: Local message passing in the HMM(1,1) junction tree during the collect phase of a 
“right to left” schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate 
flows. 

Proceeding recursively in this manner one finally obtains at the root clique 

fq, 1N {hN-i,h N ) = ( 27 ) 

from which one can get the likelihood of the evidence, 

p(e)=p(<f>l iN )= J2 fk :S (hN ~'’ hN )• ( 28 ) 

^iV-l T N 

We note that Equation 26 directly corresponds to the recursive equation (Equation 20 in 
Rabiner (1989)) for the a variables used in the forward phase of the F-B algorithm, the standard 
HMM(1,1) inference algorithm. In particular, using a “left-to-right” schedule the updated 
potential functions on the separators between the hidden cliques^ the (hi) functions, are 
exactly the a variables. Thus, when applied to HMM(1,1), the JLO algorithm produces exactly 
the same local recursive calculations as the forward phase of the F-B algorithm. 

One can also show an equivalence between the backward phase of the F-B algorithm and the 
JLO inference algorithm. Let the “left-most” clique in the chain, (Hi.H^), be the root clique 
and define a schedule such that the flows go from right to left. Figure 9 shows a local portion 
of the clique tree and the associated flows. Consider that the potential on clique ( Hi , i/j+i) has 
been updated already by earlier flows from the right. Thus, by definition, 

fi, +ltN (hi,hi+i) = p(hi, /?; + i, (/)* +1:N ). (29) 

The potential on the separator between (FT;, i/j+i) and (Hi-\, Hi) is calculated as: 

= ^P( / T, / T+l^i+i.jv) (30) 

C+i 

= P( hi ) P( /? *+i I h i )P( °i+i l /? *+i )P( <^+ 2 ,ivl / T+i) (31) 

hi +1 

(by virtue of the various conditional independence relations in HMM(1,1)) 
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(32) 


= p{hi) P( h i+l\ h i)P( 0 Ul\ h i+l 


,P( c t ) i+2,N^ hi+1, 


^ + 1 


p(h. 


i+l. 


= p( h i) Y, p(^I^+i)pK+iI^+i) / c&,+2 ;/ ,+1 ^ 

fc.,, Pl'M 


U + l 


i+l. 


Defining the update factor on this separator yields 

L, w (^*') 




^ + 2,iV v 

p(hi 


= ^2 P( h t\ h t+i)p(o* +1 \ h % +i) 

G + l 


(33) 



(34) 

^<b+2 ,v^*+ 1 ) 

(35) 

^+i)A$ 8+2iJV (/b+i). 

(36) 


U + l 


This set of recursive equations in A corresponds exactly to the recursive equation (Equation 25 
in Rabiner (1989)) for the (3 variables in the backward phase of the F-B algorithm. In fact, 
the update factors A on the separators are exactly the (3 variables. Thus, we have shown that 
the JLO inference algorithm recreates the F-B algorithm for the special case of the HMM(1,1) 
probability model. 


6.2 Equivalence of Dawid’s Propagation Algorithm for Identifying MAP As¬ 
signments and the Viterbi Algorithm 

Consider that one wishes to calculate /(u* 1 , e) = vaax Xlt ^^ XK p(xi ,..., xk, e) and one also wishes 
to identify a set of values of the unobserved variables which achieve this maximum, where K 
is the number of unobserved (hidden) variables. This calculation can be acheived using a local 
propagation algorithm on the junction tree if one makes two modifications to the standard JLO 
inference algorithm described above. This algorithm is due to Dawid (1992) and as pointed out 
earlier this is the most general algorithm from a set of related methods. 

Firstly, during a how, the marginalization of the separator is replaced by: 

bs(xs) = ma xa c {x c ) (37) 

c\s 

where C is the originating clique for the how. The definition for A s( x s) is also changed in the 
obvious manner. 

Secondly, marginalization within a clique is replaced by maximization: 

fc = maxp(u). (38) 

uVc 


Given these two changes it can be shown that if the same propagation operations are carried 
out as described earlier, the resulting representation Kf at equilibrium is such that the potential 
function on each clique C is 


f(x c )= max p(xq, e, {u^ \ x c }) 
U h \x c 


(39) 
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where Xq denotes a value assignment of the hidden (unobserved) variables in clique C. Thus, 
once the Kf representation is obtained, one can locally identify the values of Xq which maximize 
the full joint probability as 

£ c = arg^ f(x c ). (40) 

In the probabilistic expert systems literature this procedure is known as generating the “most 
probable explanation” (MPE) given the observed evidence. 

The HMM(1,1) MAP problem consists of being given a set of values for the observable 
variables, e = {0\ = cq, O 2 = cq,..., Ojv = °jv} and inferring 

max p(h\,... ,h,N,e). (41) 

hi,...,h N 

or the set of arguments which acheive this maximum. Since Dawid’s algorithm is applicable to 
any junction tree it can directly be applied to the HMM(1,1) junction tree in Figure 5b. In the 
Appendix it is shown that Dawid’s algorithm, when applied to HMM(1,1), is exactly equivalent 
to the standard Viterbi algorithm. 

6.3 Discussion of the Equivalences between the HMM and JLO Algorithms 

As shown above, when HMM(1,1) is modeled as a PIN, the JLO local propagation algorithms 
(henceforth referred to as “the graphical algorithms”) for this PIN are equivalent to the well- 
known F-B and Viterbi algorithms. In itself, this equivalence is not surprising since both pairs 
of algorithms are solving exactly the same problem via local recursive updating. For example, 
Dawid’s method and the Viterbi algorithm are both direct applications of dynamic programming 
to the MAP problem. 

What is interesting about this equivalence result is that the graphical algorithms are more 
general than the F-B and Viterbi algorithms: 

1. While special purpose extensions to the standard Viterbi and F-B algorithms can be 
derived to handle various extensions to HMM(1,1) (Tao 1992), the JLO algorithms provide 
by definition a completely general exact inference method for any PIN. 

2. The graphical algorithms can easily handle other inference tasks besides just calculating 
the likelihood of the evidence or the MAP solution. For example, missing or probabilistic 
evidence, simulating values from the model, calculating partial solutions, are all easy 
to specify in terms of the graphical algorithms. These problems in principle could also 
be handled by appropriate modifications to the standard F-B and Viterbi algorithms: 
the point is that the graphical algorithms provide the natural and direct framework for 
identifying such solutions. 

Note that the obvious structural equivalence between PIN structures and HMMs has been 
noted before by Buntine (1994), Frasconi and Bengio (1994), and Lucke (1995) among others: 
however, the demonstration of equivalence of specific inference algorithms is new as far as we 
are aware. 

Using the graphical algorithms on HMM(1,1), when evidence is entered into the observable 
states and assuming m discrete states per hidden variable, the computational complexity of 
solving the inference and MAP problems is 0(Nm 2 ). Naturally, given that they are equivalent 
for HMM(1,1), this is the same complexity as the standard F-B and Viterbi algorithms. 
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7 Inference and MAP Algorithms for UPINs 


In Section 5 we described the JLO algorithm for local inference given a DPIN: for UPINs the 
procedure is very similar except for two changes to the overall algorithm. The first is the trivial 
observation that the moralization step is not necessary. The second difference, initialization of 
the junction tree is less trivial. In Section 5.2 we described how to go from a specification of 
conditional probabilities in a directed graph to an initial potential function representation on 
the cliques in the junction tree. To utilize undirected links in the model specification process 
requires new machinery to perform the initialization step. In particular we wish to compile the 
model into the standard form of a product of potentials on the cliques of a triangulated graph 
(cf. Equation 1): 

P( u ) = n a c( x c)- 
cev c 

Once this initialization step has been achieved, the JLO propagation procedure proceeds as 
before. 

Consider the chordless cycle shown in Figure 4b. Suppose that we parameterize the proba¬ 
bility distribution on this graph by specifying pairwise marginals (or pairwise potentials) on the 
four pairs of neighboring nodes. We wish to convert such a local specification into a globally 
consistent joint probability distribution, i.e., a marginal representation. An algorithm known 
as Iterative Proportional Fitting (IPF) is available to perform this conversion. Classically, IPF 
proceeds as follows (Bishop, Fienberg, & Holland, 1973). Suppose for simplicity that ah of the 
random variables are discrete (a Gaussian version of IPF is also available (Whittaker 1990)) 
such that the joint distribution can be represented as a table. The table is initialized with equal 
values in all of the cells. For each marginal in turn, the table is then rescaled by multiplying 
every cell by the ratio of the desired marginal to the corresponding marginal in the current 
table. The algorithm visits each marginal in turn, iterating over the set of marginals. If the 
set of marginals are consistent with a single joint distribution, the algorithm is guaranteed to 
converge to the joint distribution. Once the joint is available, the potentials in Equation 1 can 
be obtained (in principle) by marginalization. 

Although IPF solves the initialization problem in principle, it is inefficient. Jirousek and 
Preucil (1995) developed an efficient version of IPF that both avoids the need for storing the 
joint distribution as a table and avoids the need for explicit marginalization of the joint to 
obtain the clique potentials. Jirousek’s version of IPF represents the evolving joint distribution 
directly in terms of junction tree potentials. The algorithm proceeds as follows. Let I be a set 
of subsets of V. For each I £ I, let q(xi) denote the desired marginal on the subset I. Let 
the joint distribution be represented as a product over junction tree potentials (Equation 1), 
where each ac is initialized to an arbitrary constant. Visit each I £ I in turn, updating the 
corresponding clique potential ac (i.e, that potential ac for which I C C) as follows: 

a C'( x c) = a c< x c)^0y 

The marginal p(xi) is obtained via the JLO algorithm, using the current set of clique potentials. 
Intelligent choices can be made for the order in which to visit the marginals to minimize the 
amount of propagation needed to compute p(xi). This algorithm is simply an efficient way of 
organizing the IPF calculations and inherits the latter’s guarantees of convergence. 

For mixed (or chain) graphs, the clique potentials are initialized to constant values and are 
multiplied by the appropriate conditional probability distributions associated with the directed 
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Figure 10: (a) the IIPIN structure for the HMM(1,2) model with r = 2, (b) a triangulation of 
this IIPIN structure. 

links (if any). The marginals associated with the undirected links are then incorporated into 
the clique potentials by running IPF. 


8 More Complex HMMs for Speech Modeling 

Although hidden Markov models have provided an exceedingly useful framework for the mod¬ 
eling of speech signals, it is also true that the simple HMM(1,1) model underlying the standard 
framework has strong limitations as a model of speech. Real speech is generated by a set of 
coupled dynamical systems (lips, tongue, glottis, lungs, air columns, etc.), each of which obeys 
particular dynamical laws. This coupled physical process is not well modeled by the unstruc¬ 
tured state transition matrix of HMM(1,1). Moreover, the first-order Markov properties of 
HMM(1,1) are not well suited to modeling the ubiquitous coarticulation effects that occur in 
speech, particularly coarticulatory effects that extend across several phonemes (cf. Kent & 
Minihe, 1977). A variety of techniques have been developed to surmount these basic weak¬ 
nesses of the HMM(1,1) model, including mixture modeling of emission probabilities, triplione 
modeling, and discriminative training. All of these methods, however, leave intact the basic 
probabilistic structure of HMM(1,1) as expressed by its PIN structure. 

In this section we describe several extensions of HMM(1,1) that assume additional proba¬ 
bilistic structure beyond that assumed by HMM(1,1). PINs provide a key tool in the study of 
these more complex models. The role of PINs is twofold: first, they provide a concise description 
of the probabilistic dependencies assumed by a particular model, and second, they provide a 
general algorithm for computing likelihoods. This second property is particularly important— 
the existence of the JLO algorithm frees us from having to derive particular recursive algorithms 
on a case-by-case basis. 

The first model that we consider can be viewed as a coupling of two HMM(1,1) chains (Saul 
& Jordan, 1995). Such a model can be useful in general sensor fusion problems, for example 
in the fusion of an audio signal with a video signal in lipreading. Because different sensory 
signals generally have different bandwidths, it may be useful to couple separate Markov models 
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that are developed specifically for each of the individual signals. The alternative is to force the 
problem into an HMM(1,1) framework by either oversampling the slower signal, which requires 
additional parameters and leads to a high-variance estimator, or downsampling the faster signal, 
which generally oversmoothes the data and yields a biased estimator. Consider the HMM(1,2) 
structure shown in Figure 10a. This model involves two HMM(1,1) backbones that are coupled 
together via undirected links between the state variables. Let H- 1 ^ and denote the i th 
state and i th output of the “fast” chain, respectively, and let H- 2 ^ and o\ 2 ^ denote the i th state 
and i th output of the “slow” chain. Suppose that the fast chain is sampled r times as often as 
the slow chain. Then is connected to H- 2 ^ for i' equal to r(i — 1) + 1. Given this value for 
fi, the Markov model for the coupled chain implies the following conditional independencies for 
the state variables: 


t n i’ i n i / \ n l ■> n l ? • • • 5 °V-2’ n i-2’ U i-2’ n i-l j, 

(42) 

as well as the following conditional independencies for the output variables: 

{0? } ,o! 2) } -i- o! 1} , H (2 \ 0 (2 \ ..., H& ,(43) 


Additional conditional independencies can be read off the UPIN structure (see Figure 10a). 

As is readily seen in Figure 10a, the HMM(1,2) graph is not triangulated, thus the HMM(1,2) 
probability model is not decomposable. However, the graph can be readily triangulated to form 
a decomposable cover for the HMM(1,2) probability model (see Section 3.1.2). The JLO al¬ 
gorithm provides an efficient algorithm for calculating likelihoods in this graph. This can be 
seen in Figure 10b, where we show a triangulation of the HMM(1,2) graph. The triangulation 
adds 0(Nh ) links to the graph (where Nh is the number of hidden nodes in the graph) and 
creates a junction tree in which each clique is a cluster of three state variables from the under¬ 
lying UPIN structure. Assuming m values for each state variable in each chain, we obtain an 
algorithm whose time complexity is 0(iV^m 3 ). This can be compared to the naive approach of 
transforming the HMM(1,2) model to a Cartesian product HMM(1,1) model, which not only 
has the disadvantage of requiring subsampling or oversampling, but also has a time complexity 
of 0(Nf l Tn 4 ). 

Directed graph semantics can also play an important role in constructing interesting vari¬ 
ations on the hidden Markov model theme. Consider Figure 11a, which shows an HMM(1,2) 
model in which a single output stream is coupled to a pair of underlying state sequences. In 
a speech modeling application such a structure might be used to capture the fact that a given 
acoustic pattern can have multiple underlying articulatory causes. For example, equivalent 
shifts in formant frequencies can be caused by lip-rounding or tongue-raising; such phenomena 
are generically refered to as “trading relations” in the speech psychophysics literature (Lind- 
blom 1990; Perkell et al. 1993). Once a particular acoustic pattern is observed, the causes 
become dependent; thus for example, evidence that the lips are rounded would act to discount 
inferences that the tongue has been raised. These inferences propagate forward and backward 
in time and couple the chains. Formally, these induced dependencies are accounted for by the 
links added between the state sequences during the moralization of the graph (see Figure lib). 
This figure shows that the underlying calculations for this model are closely related to those of 
the earlier HMM(1,2), but the model specification is very different in the two cases. 

Saul and Jordan (1996) have proposed a second extension of the HMM(1,1) model which 
is motivated by the desire to provide a more effective model of coarticulation (see also Stolorz, 
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Figure 11: (a) the DPIN structure for HMM(1,2) with a single observable sequence coupled to 
a pair of underlying state sequences, (b) the moralization of this DPIN structure. 
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Figure 12: The UPIN structure for HMM(3,1). 


23 


1994). In this model, shown in Figure 12, coarticulatory influences are modeled via additional 
links between output variables and states along an HMM(1,1) backbone. One approach to 
performing calculations in this model is to treat it as a K th -oiAei Markov chain, and transform 
it into an HMM(1,1) model by defining higher-order state variables. A graphical modeling 
approach is more flexible—it is possible for example to introduce links between states and 
outputs K time steps apart without introducing links for the intervening time intervals. More 
generally, the graphical modeling approach to the HMM(K,1) model allows the specification of 
different interaction matrices at different time scales; this is awkward in the I ( th -order Markov 
chain formalism. 

The HMM(3,1) graph is triangulated as is, and thus, the time complexity of the JLO 
algorithm is therefore 0(1 V/,to 3 ). In general a HMM(K,1) graph creates cliques of size 0(m K ) 
and the JLO algorithm runs in time 0(Nhm K ). 

As these examples suggest, the graphical modeling framework provides a useful framework 
for exploring extensions of hidden Markov models. The examples also make clear, however, 
that the graphical algorithms are no panacea. The m K complexity of HMM(K,1) will be 
prohibitive for large K. Also, the generalization of HMM(1,2) to HMM(1,K) (couplings of K 
chains) is intractable. Recent research has therefore focused on approximate algorithms for 
inference in such structures—see Saul and Jordan (1996) for HMM(K,1) and Ghahramani and 
Jordan (1996) and Williams and Hinton (1990) for HMM(1,K). These authors have developed 
an approximation methodology based on mean-field theory from statistical physics. While 
discussion of mean-field algorithms is beyond the scope of this paper, it is worth noting that the 
graphical modeling framework plays a useful role in the development of these approximations. 
Essentially the mean-field approach involves creating a simplified graph for which tractable 
algorithms are available, and minimizing a probabilistic distance between the tractable graph 
and the intractable graph. The JLO algorithm is called as a subroutine on the tractable graph 
during the minimization process. 

9 Learning and PINs 

9.1 Parameter Estimation for PINs 

The parameters of a graphical model can be estimated with maximum-likelihood (ML), 
maximum-a-posteriori (MAP), or full Bayesian methods, using traditional techniques such as 
gradient descent, expectation-maximization (EM) (e.g., Dempster et al., 1977), and Monte- 
Carlo sampling (e.g., Neal, 1993). For the standard HMM(1,1) model discussed in this paper, 
where either discrete, Gaussian, or Gaussian-mixture codebooks are used, a ML or MAP esti¬ 
mate using EM is a well-known efficient approach (Poritz 1988; Rabiner 1989). An important 
aspect of the application of the EM algorithm to PINs is that the JLO algorithm can be used 
to perform the E step. 

For purposes of illustration, and in keeping with the rest of the paper, let us consider 
the case where all variables in U are discrete. Let x k and pa(Xy denote the £;th state of 
variable X and jth state of variables pa( A), respectively. Suppose we have a directed HMM- 
like model M (a DPIN) with mutually independent parameters 0 = U jki^Hjk^Ojk}, where 
0Hjk = p(h k \pa(Hi)y M) and 0ojk = p(of|pa(0 8 ')-', M) for all i. In addition, suppose we have 
observed data D = {ei,..., es }, an (iid) random sample from the true distribution. 

The EM algorithm finds a local maximum of the likelihood p(D\0, M) by initializing the 
parameters 0 (e.g., at random or via some clustering algorithm) and repeating E and M steps. 
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In the E step, we compute the expected sufficient statistic for each of the parameters, given 
D and the current values for 0. Let Sjjjk be the sufficient statistic for Ojjjk- The expected 
sufficient statistic E(SHjk\D,0, M) is given by 

s 

E(S H jk\D,9,M) = EE p(/q fc ,pa(i/j) J |e ; ,0, M) 

1=1 i 

As mentioned, an important feature of the EM algorithm applied to PINs is that each term in 
the sum can be computed using the JLO algorithm. The expected sufficient statistic for dojk 
is computed similarly. In the M step, we use the expected sufficient statistics as if they were 
actual sufficient statistics, and set the new values of 0 to be those that maximize the likelihood 
of these statistics: 

E{S Hj k\D,0,M) _ E{S Ojk \D,0,M) 

H3k ZkE(S Hjk \D,d,M) ° 3k Z k E(S 0j k\D,d,M) 

The EM algorithm also can be used to find a focal maximum of the posterior probability 
p(9\D, M) (x p(D\0 , M) ■ p(0\M), where p(6\M) is the parameter prior. Priors most often used 
are conjugate distributions, such as the Dirichfet distribution for the parameters of discrete 
variables and the mixing coefficients of Gaussian-mixture codebooks, and the normal-Wishart 
distribution for the parameters of Gaussian codebooks (DeGroot 1970; Buntine 1994; Hecker- 
man and Geiger 1995). These priors have also been used in MAP estimates of standard HMMs 
(e.g., Gauvain and Lee, 1994). Heckerman and Geiger (1995) describe a simple method for 
assessing these priors. 

The use of the EM algorithm for UPINs is similar. Suppose that the undirected model M 
consists of cliques C'ij such that the parameters of Cqj and Ci 2 j are the same for any i\ and 
f 2 • That is, suppose p(C)j = c\-\M) = 6jk for all i. In addition, suppose that the parameters 
0 = U jk@jk are mutually independent. In this case, we can estimate the parameters for the 
clique marginals, and then use Jirousek’s IPF algorithm on a triangulation of M to compute 
a consistent estimate of the joint distribution. As in the directed case, we can use the JLO 
algorithm to perform the E step: 


s 

E(S jk \D,0,M) = E)EK<4'h,0,M) 

1=1 i 

9.2 Model Selection and Averaging for PINs 

In some situations it is useful to use data to guide the selection of an appropriate model. For 
example, the presence of some arcs or the number of states of a hidden variable may be in doubt. 
One solution to this problem is the Bayesian approach, in which we assign prior probabilities 
p(M) to different models, and compute their relative posterior probabilities given data: 

p(M\D) oc p(M) p(D\M) = p(M) Jp(D\0,M) p(0\M) dd (44) 

We then select the model with the highest posterior probability, or average the predictions of 
two or more models weighted by their relative posterior probabilities. 

When data is missing—for example, when some variables are hidden—the exact computation 
of the integral in Equation 44 is usually intractable. Nonetheless, simple approximations to this 
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integral exist, such as the Bayesian Information Criterion (BIC) described by Schwarz (1978): 

logp(D\M) ss logp(_D|0, M) — ^log S 

where 0 is the ML estimate, S is the number of cases in D, and d is the dimension of M — 
typically, the number of parameters of iff. The first term of this “score” for M rewards how 
well the data fits iff, whereas the second term punishes model complexity. Note that this score 
does not depend on the parameter prior, and thus can be applied easily. 1 For examples of 
applications of BIC in the context of PINs and other statistical models, see Raftery (1995). 

The BIC score is the additive inverse of Rissanen’s (1987) minimum description length 
(MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are 
hypothesis testing (Raftery 1995) and cross validation (Fung and Crawford 1990). Buntine (in 
press) provides a comprehensive review of the literature on learning PINs. 

In the context of HMM(IL, J) type structures, an obvious question is how one could learn 
such structure from data, where K and J are unknown a priori. From a model identification 
viewpoint, this is an easier problem than that of learning an arbitrary PIN, because the possible 
models under consideration are highly constrained. Thus, using both the estimation techniques 
for a particular model described in the previous section (and the JLO algorithm for solving the 
E-step as described in detail earlier in the paper), and the Bayesian (and alternative) model 
selection procedures outlined above, the algorithmic prescriptions for learning such models in 
a principled fashion are already in place. 

10 Summary 

Probabilistic independence networks provide a useful framework for both the analysis and ap¬ 
plication of multivariate probability models when there is considerable structure in the model in 
the form of conditional independence. The graphical modelling approach both clarifies the inde¬ 
pendence semantics of the model and yields efficient computational algorithms for probabilistic 
inference. This paper has shown that it is useful to cast HMM structures in a graphical model 
framework. In particular, the well known F-B and Viterbi algorithms were shown to be special 
cases of more general algorithms from the graphical modelling literature. Furthermore, more 
complex HMM structures, beyond the traditional first-order model, can be analyzed profitably 
and directly using generally-applicable graphical modeling techniques. 
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Appendix 1: The Viterbi Algorithm for HMM(1,1) is a Special Case of 
Dawid’s Algorithm 

As with the inference problem, let the final clique in the chain containing (i/jv-i, -Hjv) be the 
root clique and use the same schedule, i.e., first a “left-to-right” collection phase into the root 
clique, followed by a “right-to-left” distribution phase out from the root clique. Again it is 
assumed that the junction tree has been initialized so that the potential functions are the local 
marginals, and the observable evidence e has been entered into the cliques in the same manner 
as described for the inference algorithm. 

We refer again to Figure 8: the sequence of flow and absorption operations is identical to 
that of the inference algorithm with the exception that marginalization operations are replaced 
by maximization. Thus, the potential on the separator between (Oi,Hf) and (i/j_i, Hf) is 
initially updated to 

fOi(hi) = ma xp(hi, of) = p(h t , of). (45) 

The update factor for this separator is 

XoAhi> = ififr =p(a - ]hi> ’ (46) 

and after absorption into the clique (i/j_i, Hf) one gets 

fOi(hi-i,hi) = p(/q_i,/q)p(o*|/q). (47) 
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Now consider the flow from clique 44/_i) to (i4/_i, Hi). Let 44/j = {if/,..., Hj} 

denote a set of consecutive observable variables and h* ■ = {h*,...,h*}, denote the observed 
values for these variables, 1 < i < j < N. Assume that the potential on separator has 

been updated to 

/$!,;_! 0;-i) = maxp(/i i _i,/ii i i_ 2 ,^ )i _ 1 ) (48) 

via earlier flows in the schedule. Thus, the update factor for separator becomes 


1 ) 


max/ij^j p(h 8 _i, h lj8 - 2 , // 8 _i) 

p(hi-i) 


and this gets absorbed into clique (i4/_i,i4/) to produce 


(49) 


(^b —1 •> 


/o, (^L —1 ? — — 

max/,, 

p( h t -!, /q )p( O* I hi )-- 


p(hi-i,hi t i-2i<j>i t i-i) 

p(hi-i) 


(50) 

(51) 


We can now obtain the new potential on the separator for the flow from clique (i4/_i,i4/) 

(44/, ) ? 


h lt (hi) = max/ $18 (/q_i,/q) (52) 

’ N—i 

= p(o*|/ii)max{p(/ii|/ii_i) maxp(/i i _i,/ii )i _ 2 ,^ )i _ 1 )} (53) 

h t -1 

= p(o*\hi) max{p(/i i |/ii_i)p(/ii_i,/ii )i _ 2 ,^ )i _ 1 )} (54) 

441,Z-l 

= maxp(/i i ,/ii )i _i,^ )i ) (55) 

441, t-1 


which is the result one expects for the updated potential at this clique. Thus, we can express 
the separator potential f$ 1 8 (/q) recursively (via Equation 54) as 


/<J>i ,i( h i) = p(o*\h t ) max{p(h t \h t _ 1 )U 1 ^(/q-i)}. (56) 

Hi — 1 

This is the same recursive equation as used in the S variables in the Viterbi algorithm (Equation 
33a in Rabiner (1990)): the separator potentials in Dawid’s algorithm using a left-to-right 
schedule are exactly the same as the 6's used in the Viterbi method for solving the MAP 
problem in HMM(1,1). 

Proceeding recursively in this manner one finally obtains at the root clique 


/$! jv(4lv-i, h N ) = max p(h/v-i, h N , h/v- 2 , 4>i,n) (57) 

m,N- 2 

from which one can get the likelihood of the evidence given the most likely state of the hidden 
variables: 


/(e) = max /$, w (hjv-i, h N ) 

/4jV_l?/ljV 

= maxp(hi i jv,/yjv) 

441 ,N 


(58) 

(59) 


Identification of the values of the hidden variables which maximize the evidence likelihood 
can be carried out in the standard manner as in the Viterbi method, namely by keeping a 
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pointer at each clique along the flow in the forward direction back to the previous clique and 
then backtracking along this list of pointers from the root clique after the collection phase is 
complete. An alternative approach is to use the distribute phase of the Dawid algorithm: this 
has the same effect, namely, once the distribution flows are completed, each local clique can 
calculate both the maximum value of the evidence likelihood given the hidden variables and 
the values of the hidden variables in this maximum which are local to that particular clique. 
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